Parallelize the `TableVectorizer` column-wise #592

LeoGrin · 2023-06-12T16:42:18Z

…_vec_parallelism

LeoGrin · 2023-06-13T09:55:13Z

It works on the latest sklearn version (1.2.2), but not on the older ones (I think it would be quite painful to make it work). Do we still want to support only recent sklearn versions? @GaelVaroquaux

skrub/_table_vectorizer.py

GaelVaroquaux · 2023-06-13T09:58:14Z

It works on the latest sklearn version (1.2.2), [...]Do we still want to support only recent sklearn versions?

+1 for supporting only sklearn 1.2.2 and later (can we try 1.2.1 ?).

Can someone update our requirements and CI build matrix? @LilianBoulard ?

LeoGrin · 2023-06-13T12:40:56Z

+1 for supporting only sklearn 1.2.2 and later (can we try 1.2.1 ?).

Cool ! It also works for 1.2.1

LeoGrin · 2023-06-13T12:45:49Z

Something I don't really like is that the transformers_ attribute is a bit weird for high-cardinality columns:

[('datetime', DatetimeEncoder(), ['date_of_stop', 'time_of_stop']),
 ('low_card_cat',
  OneHotEncoder(drop='if_binary', handle_unknown='ignore'),
  ['agency',
   'subagency',
   'accident',
   'belts',
    ...
   'race',
   'gender',
   'driver_state',
   'dl_state',
   'arrest_type']),
 ('high_card_cat', GapEncoder(n_components=30), ['seqid']),
 ('high_card_cat', GapEncoder(n_components=30), ['description']),
 ('high_card_cat', GapEncoder(n_components=30), ['location']),
 ('high_card_cat', GapEncoder(n_components=30), ['search_reason_for_stop']),
 ('high_card_cat', GapEncoder(n_components=30), ['make']),
 ('high_card_cat', GapEncoder(n_components=30), ['model']),
 ('high_card_cat', GapEncoder(n_components=30), ['charge']),
 ('high_card_cat', GapEncoder(n_components=30), ['driver_city']),
 ('high_card_cat', GapEncoder(n_components=30), ['geolocation']),
 ('remainder',
  'passthrough',
  ['latitude', 'longitude', 'year', 'contributed_to_accident'])]

but I'm not sure we can avoid it with this solution.

LeoGrin · 2023-06-16T11:46:04Z

@glemaitre do you have thoughts on this? Do you think it could be worth implementing the parallelisation directly in sklearn?

glemaitre · 2023-06-16T11:57:17Z

I am not convinced that the strategy is beneficial for all types of transformers.

To give a concrete example, I think it would be faster to run a StandardScaler on a matrix of several columns instead of paralleling StandardScaler for each column of the matrix.

So the proposed strategy would make sense if internally to the transformer, there is already an explicit for loop. But in this case, I would expect to get a single instance and pass a matrix instead of having multiple instances for each column. If the parallelization is beneficial for this transformer, I would then expect the transformer to internally parallelize.

I might be missing some details to grasp the reason to dispatch a column per transformer here.

LeoGrin · 2023-06-16T12:08:45Z

Thanks for your answer @glemaitre !

I am not convinced that the strategy is beneficial for all types of transformers.
To give a concrete example, I think it would be faster to run a StandardScaler on a matrix of several columns instead of paralleling StandardScaler for each column of the matrix.

Agree! The idea would be to have a list of transformer classes for which it is useful.

So the proposed strategy would make sense if internally to the transformer, there is already an explicit for loop. But in this case, I would expect to get a single instance and pass a matrix instead of having multiple instances for each column. If the parallelization is beneficial for this transformer, I would then expect the transformer to internally parallelize.
I might be missing some details to grasp the reason to dispatch a column per transformer here.

But right now the transformers used inside the ColumnTransformer are not internally parallelized, right? My first instinct was indeed to parallelize these transformers, but it would create nested parallelism. To avoid the nesting, we discussed favoring the inner loop and removing the ColumnTransformer's own parallelism, but it was deemed too surprising for the user*. This is how we ended up dispatching a column per transformer (see discussion in #586), but there may be better solutions.

*Now that I think of it, it is indeed suprising to set the TableVectorize's n_jobs to 1 to remove the inherited ColumnTransformer parallelism, but it may work if implemented directly on sklearn? Maybe having a option to either parallelize all transformers internally and running them sequencially, or runing all transformers in parallel, but not parallelize them internally.

glemaitre · 2023-06-16T12:29:17Z

Yep, I see the issue regarding nested parallelism and oversubscription. But I don't know if we should special case or instead better handle the nested parallelism in joblib (which is a far harder task to do).

Having the nested parallelism should be somehow manageable but require the user to set properly the n_jobs of the different transformers and there is not a good default that just makes things work smoothly. But I don't know if this is worth more than having some internal hackish way to go around the problem.

I just thought now that UX-wise, having multiple instances of the same transformer for each column will not be user-friendly when it comes to inspection. We could always rebuild a single instance but it smells hackish from far away (and then there is the same issue regarding the parallelism at transform time).

LeoGrin · 2023-06-16T12:42:16Z

I just thought now that UX-wise, having multiple instances of the same transformer for each column will not be user-friendly when it comes to inspection. We could always rebuild a single instance but it smells hackish from far away (and then there is the same issue regarding the parallelism at transform time).

Yes I agree

Having the nested parallelism should be somehow manageable but require the user to set properly the n_jobs of the different transformers and there is not a good default that just makes things work smoothly. But I don't know if this is worth more than having some internal hackish way to go around the problem.

If the user provides a n_jobs for the TableVectorizer but don't touch the default internal transformers, do you think there a way to set the n_jobs of the different transformers to make it work nicely? Or it is what you mean by "there is not a good default"?

glemaitre · 2023-06-16T12:55:57Z

Or it is what you mean by "there is not a good default"?

It is what I meant :). joblib already tries to not get some catastrophic over-subscription in the case of nested parallelism and I don't think that we can do better (at least in an obvious manner). But @GaelVaroquaux and @ogrisel know better than me the internals on this subject.

Vincent-Maladiere

Hey @LeoGrin, here is a couple of additional remarks :)

skrub/_table_vectorizer.py

Vincent-Maladiere

Hey @LeoGrin, here is a couple of additional remarks :)

Co-authored-by: Vincent M <[email protected]>

into table_vec_parallelism

Vincent-Maladiere

We're almost done, here are some final comments

CHANGES.rst

skrub/_table_vectorizer.py

Co-authored-by: Vincent M <[email protected]>

skrub/_minhash_encoder.py

Vincent-Maladiere

Thanks, @LeoGrin, that's a great feature! Let's add some missing docstring, then LGTM :)

Vincent-Maladiere · 2023-09-01T08:39:40Z

You have conflicts with the main branch

LeoGrin · 2023-09-01T12:20:16Z

Thanks for all the comments @Vincent-Maladiere !

Vincent-Maladiere

some last docstring comments

skrub/_table_vectorizer.py

Co-authored-by: Vincent M <[email protected]>

jovan-stojanovic

Thanks, LGTM!

LeoGrin added 7 commits June 10, 2023 15:03

fall back to pandas if no datetime format is found

eb0a020

change changelog

f813fbb

Merge branch 'main' of https://github.com/skrub-data/skrub into table…

5d5e862

…_vec_parallelism

first working version

d9a374f

add tests

ffa9755

update changelog

5bcb6dd

typo

c005a49

LeoGrin changed the title ~~Table vec parallelism~~ Parallelise TableVectorizer on the columns Jun 12, 2023

LeoGrin and others added 2 commits June 13, 2023 10:27

copy transformer when split between columns to avoid conflict

85b7f9c

Merge branch 'main' into table_vec_parallelism

26e66bd

GaelVaroquaux reviewed Jun 13, 2023

View reviewed changes

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

LeoGrin added 2 commits June 13, 2023 13:49

update changelog

c3706c7

fix bug with repeated transformer

4147549

LilianBoulard assigned LeoGrin Jun 15, 2023

LilianBoulard added the enhancement New feature or request label Jun 15, 2023

LeoGrin mentioned this pull request Jun 19, 2023

Column-wise parallelism for Column Transformer scikit-learn/scikit-learn#26614

Open

LilianBoulard changed the title ~~Parallelise TableVectorizer on the columns~~ Parallelize the TableVectorizer column-wise Jun 19, 2023

LeoGrin added 3 commits June 22, 2023 18:32

split and merge

7771eb6

more tests

8acca83

merge with main

8dd62cb

Merge remote-tracking branch 'upstream/main' into table_vec_parallelism

9157451

Vincent-Maladiere reviewed Aug 26, 2023

View reviewed changes

LeoGrin and others added 6 commits August 26, 2023 18:08

Apply suggestions from code review

99b3a9b

Co-authored-by: Vincent M <[email protected]>

_parallel_on_columns

769df3a

explain transformers vs transformers_

28cfb4c

Merge branch 'table_vec_parallelism' of https://github.com/LeoGrin/skrub

e338fe1

into table_vec_parallelism

split merge into two functions

9aec058

add tests to check that splitting doesn't prevent resetting transformers

72dd3d4

LeoGrin requested a review from Vincent-Maladiere August 29, 2023 10:48

Vincent-Maladiere reviewed Aug 30, 2023

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

skrub/_table_vectorizer.py Show resolved Hide resolved

skrub/_table_vectorizer.py Show resolved Hide resolved

LeoGrin and others added 3 commits August 30, 2023 14:24

Apply suggestions from code review

2d5fd3a

Co-authored-by: Vincent M <[email protected]>

combine merge_unfitted and merge_fitted and move _split outside of class

2eb6f5b

don't return empty new_transformer_to_input_indices

7f7b524

LeoGrin requested a review from Vincent-Maladiere August 30, 2023 13:46

remove future warning

ebcca20

Vincent-Maladiere reviewed Aug 30, 2023

View reviewed changes

skrub/_minhash_encoder.py Show resolved Hide resolved

Vincent-Maladiere approved these changes Aug 30, 2023

View reviewed changes

LeoGrin added 4 commits September 1, 2023 13:22

remove future warning

d8e10ce

add docstrings

25d854b

fix merge

0d381d0

fix test

1cef43d

Vincent-Maladiere reviewed Sep 1, 2023

View reviewed changes

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

LeoGrin and others added 2 commits September 1, 2023 15:26

Update skrub/_table_vectorizer.py

eabdb8b

Co-authored-by: Vincent M <[email protected]>

fix type hints

134a413

jovan-stojanovic approved these changes Sep 11, 2023

View reviewed changes

LeoGrin merged commit f981247 into skrub-data:main Sep 20, 2023
21 checks passed

glemaitre mentioned this pull request Sep 27, 2023

MAINT use composition in TableVectorizer #675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize the `TableVectorizer` column-wise #592

Parallelize the `TableVectorizer` column-wise #592

LeoGrin commented Jun 12, 2023 •

edited

Loading

LeoGrin commented Jun 13, 2023

GaelVaroquaux commented Jun 13, 2023

LeoGrin commented Jun 13, 2023 •

edited

Loading

LeoGrin commented Jun 13, 2023

LeoGrin commented Jun 16, 2023

glemaitre commented Jun 16, 2023 •

edited

Loading

LeoGrin commented Jun 16, 2023 •

edited

Loading

glemaitre commented Jun 16, 2023

LeoGrin commented Jun 16, 2023

glemaitre commented Jun 16, 2023

Vincent-Maladiere left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere commented Sep 1, 2023

LeoGrin commented Sep 1, 2023

Vincent-Maladiere left a comment

jovan-stojanovic left a comment •

edited

Loading

Parallelize the TableVectorizer column-wise #592

Parallelize the TableVectorizer column-wise #592

Conversation

LeoGrin commented Jun 12, 2023 • edited Loading

LeoGrin commented Jun 13, 2023

GaelVaroquaux commented Jun 13, 2023

LeoGrin commented Jun 13, 2023 • edited Loading

LeoGrin commented Jun 13, 2023

LeoGrin commented Jun 16, 2023

glemaitre commented Jun 16, 2023 • edited Loading

LeoGrin commented Jun 16, 2023 • edited Loading

glemaitre commented Jun 16, 2023

LeoGrin commented Jun 16, 2023

glemaitre commented Jun 16, 2023

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Sep 1, 2023

LeoGrin commented Sep 1, 2023

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jovan-stojanovic left a comment • edited Loading

Choose a reason for hiding this comment

Parallelize the `TableVectorizer` column-wise #592

Parallelize the `TableVectorizer` column-wise #592

LeoGrin commented Jun 12, 2023 •

edited

Loading

LeoGrin commented Jun 13, 2023 •

edited

Loading

glemaitre commented Jun 16, 2023 •

edited

Loading

LeoGrin commented Jun 16, 2023 •

edited

Loading

jovan-stojanovic left a comment •

edited

Loading